Method, Apparatus and Computer Program Product for Audio Coding

ABSTRACT

The invention relates to a method and an apparatus in which samples of at least a part of an audio signal of a first channel and a part of an audio signal of a second channel are used to estimate a time delay between said part of the audio signal of said first channel and said part of the audio signal of said second channel. The method includes windowing the samples; performing a time-to-frequency domain transform; and determining an inter-channel time delay between said part of the audio signal of the first channel and said part of the audio signal of said second channel on the basis of the frequency domain representations. There is also disclosed a method and an apparatus for decoding the encoded samples.

TECHNICAL FIELD

The present invention relates to a method, an apparatus and a computerprogram product for coding audio signals.

BACKGROUND INFORMATION

Spatial audio processing is the effect of an audio signal originatingfrom an audio source arriving at the left and right ears of a listenervia different propagation paths. As a consequence of this effect thesignal at the left ear will typically have a different arrival time andsignal level from those of the corresponding signal arriving at theright ear. The differences between the arrival times and signal levelsare functions of the differences in the paths by which the audio signaltravelled in order to reach the left and right ears respectively. Thelistener's brain then interprets these differences to give theperception that the received audio signal is being generated by an audiosource located at a particular distance and direction relative to thelistener. An auditory scene therefore may be viewed as the net effect ofsimultaneously hearing audio signals generated by one or more audiosources located at various positions relative to the listener.

The mere fact that the human brain can process a binaural input signalin order to ascertain the position and direction of a sound source canbe used to encode and synthesise auditory scenes. A typical method ofspatial auditory coding may thus attempt to model the salient featuresof an audio scene, by purposefully modifying audio signals from one ormore different sources (channels). This may be for headphone use definedas left and right audio signals. These left and right audio signals maybe collectively known as binaural signals. The resultant binauralsignals may then be generated such that they give the perception ofvarying audio sources located at different positions relative to thelistener. A binaural signal typically exhibits two properties notnecessarily present in a conventional stereo signal. Firstly, a binauralsignal has incorporated the time difference between left and right and,secondly, the binaural signal models the so called “head shadow effect”,which results in a reduction of volume for certain frequency bands.

Recently, spatial audio techniques have been used in connection withmultichannel audio reproduction. The objective of multichannel audioreproduction is to provide for efficient coding of multi channel audiosignals comprising a plurality of separate audio channels and/or soundsources. Recent approaches to the coding of multichannel audio signalshave centred on the methods of parametric stereo (PS), such as BinauralCue Coding (BCC). BCC typically encodes the multi-channel audio signalby down mixing the input audio signals into either a single (“sum”)channel or a reduced number of channels conveying the “sum” signal. The“sum” signal may be also referred to as a downmix signal. In parallel,the most salient inter channel cues, otherwise known as spatial cues,describing the multi-channel sound image or audio scene are extractedfrom the input channels and encoded as side information. Both the sumsignal and side information from the encoded parameter set which canthen either be transmitted as part of a communication chain or stored ina store and forward type device. Many implementations of the BCCtechnique employ a low bit rate audio coding scheme to further encodethe sum signal. Subsequently, the BCC decoder generates a multi-channeloutput signal from the transmitted or stored sum signal and spatial cueinformation. Typically “sum” signals (i.e. downmix signals) employed inspatial audio coding systems are additionally encoded using low bit rateperceptual audio coding techniques, such as Advanced Audio Coding (AAC)or ITU-T Recommendation G.718 to further reduce the required bit rate.

In stereo coding of audio signals two audio channels are encoded. Inmany cases the audio channels may have rather similar content at leastpart of a time. Therefore, compression of the audio signals can beperformed efficiently by coding the channels together. This results inoverall bit rate which can be lower than the bit rate required forcoding channels independently.

A commonly used low bit rate stereo coding method is known as theparametric stereo coding. In parametric stereo coding a stereo signal isencoded using a mono coder and parametric representation of the stereosignal. The parametric stereo encoder computes a mono signal as a linearcombination of the input signals. The mono signal may be encoded usingconventional mono audio encoder. In addition to creating and coding themono signal, the encoder extracts parametric representation of thestereo signal. Parameters may include information on level differences,phase (or time) differences and coherence between input channels. In thedecoder side this parametric information is utilized to recreate stereosignal from the decoded mono signal. Parametric stereo is an improvedversion of the intensity stereo coding, in which only the leveldifferences between channels are extracted.

Another common stereo coding method, especially for higher bit rates, isknown as mid-side stereo, which can be abbreviated as M/S stereo.Mid-side stereo coding transforms the left and right channels into a midchannel and a side channel. The mid channel is the sum of the left andright channels, whereas the side channel is the difference of the leftand right channels. These two channels are encoded independently. Withaccurate enough quantization mid-side stereo retains the original audioimage relatively well without introducing severe artifacts. On the otherhand, for good quality reproduced audio the required bit rate remains atquite a high level.

In many cases stereo signals are generated artificially by panningdifferent sound sources to two channels. In these cases there typicallyare not time delays between channels, and the signals can be efficientlyencoded using for example parametric or mid-side coding.

A special case of a stereo signal is a binaural signal. A binaural audiosignal may be recorded for example by using microphones mounted in anartificial head or with a real user wearing a head set with microphonesin the close proximity of his/her ears, or by using other real recordingarrangement with two microphones close to each other. These kind ofsignals can also be artificially generated. For example, binauralsignals can be generated by applying suitable head related transferfunctions (HRTF) or corresponding head related impulse responses (HRIR)to a source signal. All these discussed signals have one special featurenot typically present in generic two-channel audio: both channelscontain in principle the same source signals with a different time delayand frequency dependent amplification. Time delay is dependent on thedirection of arrival of the sound. In the following, all these kinds ofsignals are referred as binaural audio.

One problem is how to reduce the number of bits needed to encode goodquality binaural audio. Mid-side stereo coding and parametric stereocoding techniques do not perform well, as they may not take intoconsideration time delays between channels. In case of parametricstereo, the time delay information may be totally lost. Mid-side stereo,on the other hand, may require high bit rate for binaural signals forgood quality. For maximum compression with good quality, binaural audiospecific coding method should be used.

It is feasible to think that two binaural channels can be efficientlyjoined into one channel, such as in parametric stereo coding, if thesignals can first be time aligned, i.e. the time delays between channelsare removed. Similarly, the time differences can be restored in thedecoder. Alternatively, the time aligned signals can be used forimproving the efficiency of mid-side stereo coding.

One difficulty in time alignment lies in the fact that the timedifferences between channels of an input signal may be different fordifferent time and frequency locations. In addition, there may beseveral source signals occupying the same time-frequency location.Further, the time alignment has to be performed carefully because iftime shifts are not performed cautiously, perceptual problems may arise.

SUMMARY OF SOME EXAMPLES OF THE INVENTION

In an example embodiment of the present invention a low complexityfrequency domain implementation is introduced for binaural coding. Theembodiment comprises dividing the audio spectrum of the audio channelsinto two or more subbands and selecting the delays for the subbands ineach channel. The operations to determine the delays are mainlyperformed in frequency domain.

The audio signals of the input channels are digitized to form samples ofthe audio signals. The samples may be arranged into input frames, forexample, in such a way that one input frame may contain samplesrepresenting 10 ms or 20 ms long period of the audio signal. Inputframes may further be organized are divided into analysis frames whichmay or may not be overlapping. The analysis frames are windowed withwindows, for example with sinusoidal windows, padded with certain valuesat one or both ends, and transformed into frequency domain using atime-to-frequency domain transform. An example of such transform is theDiscrete Fourier Transform (DFT). The values added at the end(s) ofoverlapping windows enable delay modification without practically anyperceptual artifacts. Each channel may be divided into subbands, and forevery channel the delay differences between channels are analysed usinga frequency domain method. The subband of one channel is shifted toobtain the best match with the corresponding subband of the otherchannel. The operations can be repeated for every subband. Bothparametric stereo or mid-side stereo type implementation can be used forencoding the aligned signals.

On the decoder side, the original delays are restored to the signals. Anefficient decorrelation can be performed to improve the spatial image ofsynthesized signals.

According to a first aspect of the present invention there is provided amethod comprising

-   -   using samples of at least a part of an audio signal of a first        channel and a part of an audio signal of a second channel to        estimate a time delay between said part of the audio signal of        said first channel and said part of the audio signal of said        second channel;        characterized in that the method comprises    -   windowing the samples of said first channel and said second        channel by a window function to form an analysis frame of said        first channel and an analysis frame of said second channel;    -   performing a time-to-frequency domain transform on the analysis        frames to form a frequency domain representation of said part of        the audio signal of said first channel and said part of the        audio signal of said second channel; and    -   determining an inter-channel time delay between said part of the        audio signal of the first channel and said part of the audio        signal of said second channel on the basis of the frequency        domain representations.

According to a second aspect of the present invention there is providedmethod comprising

-   -   receiving an encoded audio signal of a first channel and an        encoded audio signal of a second channel;        characterized in that the method comprises    -   receiving an indication of an inter-channel time delay between        said encoded audio signal of the first channel and said encoded        audio signal of the second channel;    -   decoding said encoded audio signal of the first channel and said        encoded audio signal of the second channel to form decoded        samples of the audio signal of the first channel and the audio        signal of the second channel;    -   performing a time-to-frequency domain transform on the windowed        samples to form a frequency domain representation of said audio        signal of said first channel and said audio signal of said        second channel;    -   shifting the frequency domain representation of one of said        audio signal of said first channel and said audio signal of said        second channel on the basis of said indication;    -   performing a frequency-to-time domain transform on the frequency        domain representation of said audio signal of said first channel        and said audio signal of said second channel to form decoded        samples of the audio signal of the first channel and of the        audio signal of the second channel; and    -   windowing said decoded samples of said first channel and said        second channel by a window function to form a synthesized audio        signal of the first channel and a synthesized audio signal of        the second channel.

According to a third aspect of the present invention there is providedan apparatus comprising

-   -   means for using samples of at least a part of an audio signal of        a first channel and a part of an audio signal of a second        channel to estimate a time delay between said part of the audio        signal of said first channel and said part of the audio signal        of said second channel;        characterized in that the apparatus comprises    -   means for windowing the samples of said first channel and said        second channel by a window function to form an analysis frame of        said first channel and an analysis frame of said second channel;    -   means for performing a time-to-frequency domain transform on the        analysis frames to form a frequency domain representation of        said part of the audio signal of said first channel and said        part of the audio signal of said second channel; and    -   means for determining an inter-channel time delay between said        part of the audio signal of the first channel and said part of        the audio signal of said second channel on the basis of the        frequency domain representations.

According to a fourth aspect of the present invention there is providedan apparatus comprising

-   -   means for receiving an encoded audio signal of a first channel        and an encoded audio signal of a second channel;        characterized in that the apparatus comprises    -   means for receiving an indication of an inter-channel time delay        between said encoded audio signal of the first channel and said        encoded audio signal of the second channel;    -   means for decoding said encoded audio signal of the first        channel and said encoded audio signal of the second channel to        form decoded samples of the audio signal of the first channel        and the audio signal of the second channel;    -   means for performing a time-to-frequency domain transform on the        windowed samples to form a frequency domain representation of        said audio signal of said first channel and said audio signal of        said second channel;    -   means for shifting the frequency domain representation of one of        said audio signal of said first channel and said audio signal of        said second channel on the basis of said indication;    -   means for performing a frequency-to-time domain transform on the        frequency domain representation of said audio signal of said        first channel and said audio signal of said second channel to        form decoded samples of the audio signal of the first channel        and of the audio signal of the second channel; and    -   means for windowing said decoded samples of said first channel        and said second channel by a window function to form a        synthesized audio signal of the first channel and a synthesized        audio signal of the second channel.

According to a fifth aspect of the present invention there is providedan apparatus comprising

-   -   means for receiving an encoded audio signal of a first channel        and an encoded audio signal of a second channel;        characterized in that the apparatus comprises    -   means for receiving an indication of an inter-channel time delay        between said encoded audio signal of the first channel and said        encoded audio signal of the second channel;    -   means for decoding said encoded audio signal of the first        channel and said encoded audio signal of the second channel to        form decoded samples of the audio signal of the first channel        and the audio signal of the second channel;    -   means for performing a time-to-frequency domain transform on the        windowed samples to form a frequency domain representation of        said audio signal of said first channel and said audio signal of        said second channel;    -   means for shifting the frequency domain representation of one of        said audio signal of said first channel and said audio signal of        said second channel on the basis of said indication;    -   means for performing a frequency-to-time domain transform on the        frequency domain representation of said audio signal of said        first channel and said audio signal of said second channel to        form decoded samples of the audio signal of the first channel        and of the audio signal of the second channel; and    -   means for windowing said decoded samples of said first channel        and said second channel by a window function to form a        synthesized audio signal of the first channel and a synthesized        audio signal of the second channel.

According to a fifth aspect of the present invention there is provided acomputer program product comprising a computer program code configuredto, with at least one processor, cause an apparatus to:

-   -   use samples of at least a part of an audio signal of a first        channel and a part of an audio signal of a second channel to        estimate a time delay between said part of the audio signal of        said first channel and said part of the audio signal of said        second channel;        characterized in that the computer program product comprises a        computer program code configured to, with at least one        processor, cause the apparatus to    -   window the samples of said first channel and said second channel        by a window function to form an analysis frame of said first        channel and an analysis frame of said second channel;    -   perform a time-to-frequency domain transform on the analysis        frames to form a frequency domain representation of said part of        the audio signal of said first channel and said part of the        audio signal of said second channel; and    -   determine an inter-channel time delay between said part of the        audio signal of the first channel and said part of the audio        signal of said second channel on the basis of the frequency        domain representations.

According to a fifth aspect of the present invention there is provided acomputer program product comprising a computer program code configuredto, with at least one processor, cause an apparatus to:

-   -   receive an encoded audio signal of a first channel and an        encoded audio signal of a second channel;        characterized in that the computer program product comprises a        computer program code configured to, with at least one        processor, cause the apparatus to    -   receive an indication of an inter-channel time delay between        said encoded audio signal of the first channel and said encoded        audio signal of the second channel;    -   decode said encoded audio signal of the first channel and said        encoded audio signal of the second channel to form decoded        samples of the audio signal of the first channel and the audio        signal of the second channel;    -   perform a time-to-frequency domain transform on the windowed        samples to form a frequency domain representation of said audio        signal of said first channel and said audio signal of said        second channel;    -   shift the frequency domain representation of one of said audio        signal of said first channel and said audio signal of said        second channel on the basis of said indication;    -   perform a frequency-to-time domain transform on the frequency        domain representation of said audio signal of said first channel        and said audio signal of said second channel to form decoded        samples of the audio signal of the first channel and of the        audio signal of the second channel; and    -   window said decoded samples of said first channel and said        second channel by a window function to form a synthesized audio        signal of the first channel and a synthesized audio signal of        the second channel.

The methods according to some example embodiments of the presentinvention can be used both with mono and stereo core coding. Examples ofboth of these cases are presented in FIGS. 1 a and 1 b. In the case ofstereo core codec, the binaural encoder only compensates for the delaydifferences between channels. The actual stereo codec can be inprinciple any kind of stereo codec such as an intensity stereo,parametric stereo or mid-side stereo codec. When mono core codec isused, binaural codec generates a mono downmix signal and encodes alsolevel differences between the channels. In this case the binaural codeccan be considered as a binaural parametric stereo codec.

The present invention may provide an improved and/or more accuratespatial audio image due to improved preservation of time differencebetween the channels, which is useful e.g. for binaural signals.Furthermore, to present invention may reduce computational load inbinaural/multi-channel audio encoding.

DESCRIPTION OF THE DRAWINGS

In the following the invention will be explained in more detail withreference to the appended drawings, in which

FIG. 1 a depicts an example of a system for encoding and decoding audiosignals by using a stereo core codec;

FIG. 1 b depicts an example of a system for encoding and decoding audiosignals by using a mono core codec;

FIG. 2 depicts an example embodiment of the device in which theinvention can be applied;

FIG. 3 depicts an example arrangement for encoding and decoding audiosignals according to an example embodiment of the present invention;

FIG. 4 depicts an example of samples arranged in input frames andanalysis frames, and

FIG. 5 depicts an example arrangement for encoding and decoding audiosignals according to another example embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

In the following an example embodiment of the apparatuses for encodingand decoding audio signals by utilising the present invention will bedescribed. FIG. 2 shows a schematic block diagram of a circuitry of anexemplary apparatus or electronic device 1, which may incorporate acodec according to an embodiment of the invention. The electronic devicemay for example be a mobile terminal, user equipment of a wirelesscommunication system, any other communication device, as well as apersonal computer, a music player, an audio recording device, etc.

The electronic device 1 can comprise one or more microphones 4 a, 4 b,which are linked via an analogue-to-digital converter 6 to a processor11. The processor 11 is further linked via a digital-to-analogueconverter 12 to loudspeakers 13. The processor 11 is further linked to atransceiver (TX/RX) 14, to a user interface (UI) and to a memory 7.

The processor 11 may be configured to execute various program codes 7.2.The implemented program codes may comprise encoding code routines. Theimplemented program codes 15 may further comprise an audio decoding coderoutines. The implemented program codes 7.2 may be stored for example inthe memory 7 for retrieval by the processor 11 whenever needed. Thememory 7 may further provide a section 7.1 for storing data, for exampledata that has been encoded in accordance with the invention.

The encoding and decoding code may be implemented in hardware orfirmware in embodiments of the invention.

The user interface may enable a user to input commands to the electronicdevice 1, for example via a keypad 17, and/or to obtain information fromthe electronic device 1, for example via a display 18. The transceiver14 enables a communication with other electronic devices, for examplevia a wireless communication network. The transceiver 14 may in someembodiments of the invention be configured to communicate to otherelectronic devices by a wired connection.

It is to be understood again that the structure of the electronic devicecould be supplemented and varied in many ways. As an example, there maybe additional functional elements in addition to those shown in FIG. 2or some of the elements illustrated in FIG. 2 may be omitted. As anotherexample, the electronic device may comprise one or more processorsand/or one or more memory units, although depicted as a single processor11 and a single memory unit 7 in FIG. 2.

A user of the electronic device may use the microphone 4 for inputtingaudio that is to be transmitted to some other electronic device or thatis to be stored in the data section 7.1 of the memory 7. A correspondingapplication has been activated to this end by the user via the userinterface 15. This application, which may be run by the processor 11,causes the processor 11 to execute the encoding code stored in thememory 7.

The analogue-to-digital converter 6 may convert the input analogue audiosignal into a digital audio signal and provide the digital audio signalto the processor 11. The processor 11 may then process the digital audiosignal in the same way as described with reference to the descriptionhereafter.

Alternatively, instead of employing the microphone 4 for inputting theaudio signal, a digital audio input signal may be pre-stored in the datasection 7.1 of the memory 7 and read from the memory for provision tothe processor 11.

The resulting bit stream may be provided to the transceiver 14 fortransmission to another electronic device. Alternatively, the encodeddata could be stored in the data section 7.1 of the memory 7, forinstance for a later transmission or for subsequent distribution toanother device by some other means, or for a later presentation orfurther processing by the same electronic device 1.

The electronic device may also receive a bit stream with correspondinglyencoded data from another electronic device via the transceiver 14. Inthis case, the processor 11 may execute the decoding program code storedin the memory. Alternatively, the electronic device may receive theencoded data by some other means, for example as a data file stored in amemory.

In the following an example embodiment of the operation of the device 1will be described in more detail with reference to FIG. 3. In thisexample embodiment there are two audio channels 2, 3 from which audiosignals will be encoded by a first encoder 8. Without losing generality,the first audio channel 2 can be called as the left channel and thesecond audio channel 3 can be called as the right channel. The audiosignals of the left and right channel can be formed e.g. by themicrophones 4 a, 4 b. It is also possible that the audio signals for theleft and right channel are artificially generated from a multiple ofaudio sources such as by mixing signals from different musicalinstruments into two audio channels or by processing a source signal forexample using suitable HRTF/HRIR in order to create a binaural signal.

The analog-to-digital converter 6 converts the analog audio signals ofthe left and right channel into digital samples. These samples S_(L)(t),S_(R)(t) can be stored into the memory 7 for further processing. In thepresent invention the samples are organized into input frames I1, I2,I3, I4 which can further be organized into analyses frames F1, F2, F3,F4 (FIG. 4) so that one input frame represents a certain part of theaudio signal in time domain. Successive input frames may have equallength i.e. each input frame contains the same number of samples or thelength of the input frames may vary, wherein the number of sample indifferent input frames may be different. The same applies to theanalysis frames i.e. successive analysis frames may have equal lengthi.e. each analysis frame contains the same number of samples or thelength of the analysis frames may vary, wherein the number of sample indifferent analysis frames may be different. In FIG. 3 there is depictedan example of input and analysis frames which are formed from samples ofthe audio signals. For clarity, only four input frames and analysisframes per each channel are depicted in FIG. 3 but in practicalsituations the number of input frames and analysis frames can bedifferent than that.

A first encoder 8 of the device 1 performs the analysis of the audiosignals to determine the delay between the channels in a transformdomain. The first encoding block 8 uses samples of the analysis framesof both channels in the analyses. The first encoding block 8 comprises afirst windowing element 8.1. The first windowing element 8.1 preparesthe samples for a time-to-frequency domain transform. The firstwindowing element 8.1 uses a window which can be constructed e.g. asfollows:

$\begin{matrix}{{{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{{win}_{c}\left( {t - D_{\max}} \right)},} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1}} \\{0,} & {{t = {D_{\max} + L}},\ldots \mspace{14mu},{L + {2D_{\max}} - 1},}\end{matrix} \right.} & \left( {1a} \right)\end{matrix}$

where D_(max) is the maximum delay shift (in samples) allowed,win_(c)(t) is the center window and L is the length (in samples) of thecenter window. Thus in win(t) there are D_(max) zeroes at both ends andthe center window win_(c)(t) in the middle. This means that the samplesmodified by the center window win_(c)(t) and the zero values at bothends of the window win(t) are entered to a time-to-frequency domaintransformer 8.2. The time-to-frequency domain transformer 8.2 produces aset of transform coefficients L(k), R(k) for further encoding. Thetime-to-frequency domain transformer 8.2 uses, for example, discretefourier transform (DFT) or shifted discrete fourier transform (SDFT) inthe transform. Also other transform methods can be used which transforminformation of time domain samples into frequency domain.

In this example embodiment the overlap of the analysis frames isL/2+2D_(max) samples, i.e. it is over 50%. The next analysis framestarts L/2 samples after the starting instant of the previous analysisframe. In other words, the next analysis frame starts in the middle ofthe previous input frame. In FIG. 4 this is depicted so that twoconsecutive analysis frames, e.g. the first analysis frame F1 and thesecond analysis frame F2, have common samples i.e. they both utilizesome of the samples of the same input frame I1.

Zeroes are used at the both ends of the window so that the frequencydomain time shift do not cause perceptual artefacts due to samplescircularly shifting from the beginning of the frame to the end, or viceversa.

It should be noted here that also other values than zeros can be used toconstruct the window win(t). As an example, values that are close tozero or other values that result in attenuating the respective portionof windowed signal to have amplitude that is essentially zero or closeto zero can be used instead of zeros. It may also be sufficient to addzeros or other suitable values only to one side of the center windowwin_(c)(t). For example, the window can be constructed as follows:

$\begin{matrix}{{{win}(t)} = \left\{ \begin{matrix}0 & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{win}_{c}\left( {t - D_{\max}} \right)} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1},}\end{matrix} \right.} & \left( {1b} \right)\end{matrix}$

or as follows:

$\begin{matrix}{{{win}(t)} = \left\{ \begin{matrix}{{win}_{c}(t)} & {{t = 0},\ldots \mspace{14mu},{L - 1}} \\0 & {{t = L},\ldots \mspace{14mu},{D_{\max} + L - 1},}\end{matrix} \right.} & \left( {1c} \right)\end{matrix}$

In the analysis window according to the equation (1b), the zeros areadded only in the beginning of the analysis window. Equally, the zeroescan be added only at the end of the window as defined by the equation(1c). Furthermore, it is possible to add any suitable number of zeros tothe both ends of the window as long as the total number of zeroes isequal to or larger than D_(max). With all analysis windows fulfillingthis condition, the shifting can be performed to any direction, becausewith DFT transform samples which are shifted over the frame boundaryappear at the other end of the window. Thus, a generalized from of theanalysis window may be defined as follows.

$\begin{matrix}{{{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{1} - 1}} \\{{{win}_{c}\left( {t - D_{1}} \right)},} & {{t = D_{1}},\ldots \mspace{14mu},{D_{1} + L - 1}} \\{0,} & {{t = {D_{1} + L}},\ldots \mspace{14mu},{L + D_{1} + D_{2} - 1},}\end{matrix} \right.} & \left( {1d} \right)\end{matrix}$

where D₁ and D₂ are non-negative integer values and fulfil the conditionD₁+D₂≧D_(max).

In windows defined by the equation (1b), (1c) or (1d) the next analysisframe always starts L/2 samples after the starting instant of theprevious analysis frame. It is also possible that the window size is notconstant but it varies from time to time. In this description the lengthof current window is denoted as W.

Next, the transform coefficients are input to an analysis block 8.5 inwhich the delay between channels is determined for enabling thealignment of the transform coefficients of one audio channel withrespect to another audio channel. The operation of the analysis block8.5 will be described in more detail later in this application.

The transform coefficients of the reference channel and the alignedchannel can be encoded by a second encoder 9, which can be, for example,a stereo encoder as depicted in FIG. 1 a or a mono encoder as depictedin FIG. 1 b. The second encoder 9 encodes the channels e.g. by using themid-side coding or parametric coding.

The signal formed by the second encoder 9 can be transmitted by thetransmitter 14 to another electronic device 19, for example a wirelesscommunication device. The transmission may be performed e.g. via a basestation of a wireless communication network.

It should be noted that it is also possible that the encoded signal,which can be a bitstream, a series of data packets, or any another formof signal carrying the encoded information, is not immediatelytransmitted to another electronic device but it is stored to the memory7 or to another storage medium. The encoded information can later beretrieved from the memory 7 or the storage medium for transmission toanother device or for distribution to another device by some othermeans, or for decoding or other further processing by the same device 1.

Now, the operation of the elements of the first encoder 8 will bedescribed in more detail. For illustrative purposes the left and rightinput channels are denoted as I and r, respectively. Both of thechannels are windowed in the first windowing element 8.1 withoverlapping windows as defined for example by the equation (1a), (1b),(1c) or (1d). In the equations (1a), (1b), (1c) and (1d) the centerwindow, which in this example embodiment is a sinusoidal window

$\begin{matrix}{{{{win}_{C}(t)} = {\sin \left( {\frac{\pi}{L}\left( {t + \frac{1}{2}} \right)} \right)}},{t = 0},\ldots \mspace{14mu},{L - 1},} & (2)\end{matrix}$

fulfils the following condition: win_(c)(t)²+win_(c)(t+L/2)²=1.

Let L(k) and R(k), k=0, . . . , W−1 be the discrete fourier transform(DFT) domain coefficients of the current windowed left and right inputframes, respectively. W is the length of the transform and is defined bythe window length. Coefficients are symmetric around index k_(m)=W/2,such that for example L(km+k)=conj(L(k_(m)−k), where conj denotescomplex conjugate of the transform coefficient. For now on thediscussion is concentrated only on the first k_(m)+1 transformcoefficients.

After the windowed samples of both channels have been transformed fromtime domain to transform domain by the time-to-transform domaintransformer 8.2 the discrete fourier transform domain channels may bedivided into subbands by the subband divider 8.3. The subbands can beuniform i.e. each subband is equal in the bandwith, or non-uniform forexample in such a way that at low frequencies the subbands are narrowerand at higher frequencies wider. The subbands do not have to cover thewhole frequency range but only a subset of the frequency range may becovered. For example, in some embodiments of the invention it may beconsidered sufficient that the lowest 2 kHz of the full frequency rangeis covered.

Let us denote the boundary indexes of B subbands as k_(b), b=1, . . . ,B+1. Now for example the bth subband of the right channel can be denotedas R_(b)(k)=R(k_(b)+k), where k=0, . . . , k_(b+1)−k_(b)−1.

The leading channel selector 8.4 may select for each band one of theinput channel audio signals as the “leading” channel. In an exampleembodiment of the invention, the leading channel selector 8.4 tries todetermine in which channel the signal is leading the channel(s) i.e. inwhich channel a certain feature of the signal occurs first. This may beperformed for example by calculating a correlation between two channelsand using the correlation result to determine the leading channel. Theleading channel selector 8.4 may also select the channel with thehighest energy as the leading channel. In other embodiments, the leadingchannel selector may select the channel according to a psychoacousticmodelling criteria. In other embodiments of the invention, the leadingchannel selector 8.4 may select the leading channel by selecting thechannel which has on average the smallest delay. However, in anembodiment where there are only two input audio channels they both havesame delays in relation to each other with opposite signs. In someembodiments the leading channel may be a fixed channel, for example thefirst channel of the group of audio input channels may be selected to bethe leading channel. Information on the leading channel may be deliveredto the decoder 20 e.g. by encoding the information and providing it forthe decoder along with the audio encoded data.

The selection of the leading channel may be made from analysis frame toanalysis frame according to a predefined criteria.

One or more of the other channel(s) i.e. the non-leading channel(s) canbe called as a trailing channel.

Corresponding subbands of the right and left channels are analyzed bythe analysis block 8.5 to find the time difference (delay) between thechannels. The delay is searched, for example, by determining a set ofshifted signals for a subband of a first channel, each shifted signalcorresponding to a delay value in a set of different delays, and foreach shifted signal calculating a dot product between the shiftedsignals and respective signal of a second channel in order to determinea set of dot products associated with respective delay values in a setof different delays. A subband R_(b)(k) can be shifted d samples in timedomain using

$\begin{matrix}{{{R_{b}^{d}(k)} = {{R_{b}(k)}{\exp \left( \frac{{- }\; 2\; \pi \; {d\left( {k + k_{b}} \right)}}{W} \right)}}},} & (3)\end{matrix}$

in which positive values of the delay d shift time domain (subband)signal d samples to the left (earlier in time), and negative values ofthe delay d shift time domain (subband) signal |d| samples to the right(later in time), respectively. The shifting does not change the absolutevalues of the frequency domain parameters, only the phases are modified.Now the task is to find the delay d which maximizes the dot productbetween the complex-conjugates of the set of shifted frequency-domainsubband signals of the right channel and respective (non-shifted)signals of the left channel

$\begin{matrix}{{\max\limits_{d}\; {{real}\left( {\sum\limits_{k = 0}^{k_{b + 1} - k_{b}}{{{\overset{\_}{R}}_{b}^{d}(k)}{L_{b}(k)}}} \right)}},{d \in \left\lbrack {{- D_{\max}},D_{\max}} \right\rbrack}} & (4)\end{matrix}$

where R _(b) ^(d)(k) is the complex conjugate of R_(b) ^(d)(k) and real() indicates the real part of the complex-valued result. Only the realpart of the dot product is used as it measures the similarity withoutany phase shifts. As an alternative, equation (4) may be modified insuch a way that the real part of the dot products between the set ofshifted frequency-domain subband signals of the right channel andcomplex-conjugate of the respective signals of the left channel aredetermined. With these computations the optimal shift d_(b) for thecurrent subband b is found. Information on the delay d_(b) for thesubband is also provided to a decoder 20. To keep the bit rate low theused set of allowed values for the delay d_(b) may be limited.

For example at the highest frequencies it may not always be perceptuallyreasonable to modify the signal if they are not considered similarenough.

The strength of similarity may be measured for example using thefollowing equation:

$\begin{matrix}{W_{b} = {\frac{{real}\left( {\sum\limits_{k = 0}^{k_{b + 1} - k_{b}}{{{\overset{\_}{R}}_{b}^{d_{b}}(k)}{L_{b}(k)}}} \right)}{\sum\limits_{k = 0}^{k_{b + 1} - k_{b}}\left( {{{R_{b}^{d_{b}}(k)}}{{L_{b}(k)}}} \right)}.}} & (5)\end{matrix}$

If W_(b) is smaller than a predefined threshold for the subband b, thedelay d_(b) is set to zero. In general, the thresholds may be subbanddependent and/or may vary from frame to frame. As an example, lowerthresholds may be used for subbands of higher frequencies.

According to an example embodiment of the present invention the channelin which a feature of the input signal appear first is not modified inthe current subband. This implies that when time aligning the signals,no signal should ever be shifted later in time (delayed). This isperceptually motivated by the fact that the channel (subband) in whichthings happen first is perceptually more important and containstypically also more energy than the other channel(s). Since in the aboveexample the optimal shift is searched for the right channel as shown inequations (3) and (4), the following logic can be used:

If the delay d_(b) for the current subband b is greater than 0, then

-   -   Shift the transform coefficients of the right channel R_(b)(k)        d_(b) samples using the equation (3), k=0, . . . ,        k_(b+1)−k_(b)−1.        else    -   Shift the transform coefficients of the left channel L_(b)(k)        −db samples using the equation (3), k=0, . . . ,        k_(b+1)−k_(b)−1.

However, in some implementations it may be possible to apply shifting todelay the leading channel instead of, or in addition to, the shifting ofthe trailing channel.

The delay analysis and the shifting is performed independently for everysubband. After a certain amount of the subbands or all subbands havebeen analysed and modified, the aligned DFT domain signals L′(k) andR′(k) have been obtained, which are then transformed to the time domainby the frequency-to-time domain transformer 8.6. In the time domain, thesignals are again windowed by the second windowing element 8.7 whichuses the window win(t) to remove perceptual artefacts outside thecentral part of the window. Finally the overlapping parts of thesuccessive frames are combined, e.g. added together, to obtain alignedtime domain signals I′ and r′.

Next, the decoding of the encoded audio signals will be described inmore detail with reference to the FIGS. 1 a, 1 b and 3. The decoding maybe performed by the same device 1 which has the encoder 8, or by anotherdevice 19 which may or may not have the encoder 8 of the presentinvention.

The device 19 receives 22 the encoded audio signal. In the device 19 thefirst decoder 21, as illustrated in FIGS. 1 a and 1 b, encoded left andright channels {circumflex over (l)}′ and {circumflex over (r)}′ areobtained and input to the second decoder 20. In the second decoder 20the third windowing element 20.1 performs windowing similar to thewindowing used in the first encoder 8. The windowing results aretransformed from time to frequency domain by the secondtime-to-frequency domain transformer 20.2. After the DFT transform thefrequency domain signals {circumflex over (L)}′(k) and {circumflex over(R)}′(k) have been obtained. Now, the decoded delay values d_(b) areobtained from the encoded data. The inverse signal modification of theencoder is now performed i.e. the delay between the signals will berestored by the delay insertion block 20.3. The delay insertion block20.3 uses, for example, the following logic:

If the delay d_(b) for the current subband b is greater than 0, then

-   -   Shift the transform coefficients of the right channel R′_(b)(k)        −d_(b) samples using equation (3), k=0, . . . , k_(b+1)−k_(b)−1.        else    -   Shift the transform coefficients of the left channel L′_(b)(k)        d_(b) samples using equation (3), k=0, . . . , k_(b+1)−k_(b)−1.

As a result, the transform coefficients of the left and right channel{circumflex over (L)}(k) and {circumflex over (R)}(k) are obtained,which are transformed to the time domain with inverse discrete fouriertransform (IDFT) block 20.4, windowed by the fourth windowing element20.5, and combined with overlap-add with the other frames by the secondcombiner 20.6. The digital samples can now be converted to analoguesignal by the digital-to-analogue converter 12 and transformed intoaudible signals by the loudspeakers 13, for example.

The above description revealed some general concepts of the presentinvention. In the following some details of the core encoder i.e. thesecond encoder 9 and the core decoder i.e. the first decoder 22 will bedescribed.

It is possible to perform core stereo coding for time aligned signals I′and r′ totally independently of the binaural coding performed by thefirst encoder 8. This makes the implementation very flexible for allkinds of core coding methods. For example common mid-side or parametricstereo coders can be used.

Another possibility is to integrate the stereo coding part to thebinaural codec wherein the second encoder 9 and the first decoder 21 arenot needed. Both mid-side and parametric coding are in principlepossible also in this case. In integrated coding one possibility is todo all the encoding in the frequency domain. An example embodiment ofthis is depicted in FIG. 5 in which similar reference numerals are usedfor the corresponding elements as in the embodiment of FIG. 3.

In binaural parametric coding the levels of the original signals areanalyzed in the first encoder 8 and the information is submitted to thesecond decoder 20, either in the form of energy level or as scalingfactors. Example embodiments for both of these methods are introducedhere.

The DFT domain representation is divided into C energy subbands whichcover the whole frequency band of the signal to be encoded. The boundaryindexes of the subbands can be denoted as k_(c), c=1, . . . , C+1. Itshould be noticed that these subbands do not have to be the samesubbands as used for the delay analysis. Now for example the c-thsubband of the right channel can be denoted as R_(c)(k)=R(k_(c)+k),where k=0, . . . , k_(c+1)−k_(c)−1.

For both channels and for all gain subbands, the energies are calculatedas

$\begin{matrix}{{{E_{X}(c)} = {\log_{10}\left( \frac{\sum\limits_{k = 0}^{k_{c + 1} - k_{c} - 1}{{X_{c}(k)}}^{2}}{k_{c + 1} - k_{c} - 1} \right)}},} & (6)\end{matrix}$

where X denotes either R or L. If it is selected that the energy valuesare submitted to the decoder, the energies are quantized to Ê_(X)(c).Notice that the energies may be estimated for example from R(k) orR′(k), since the magnitudes do not change in delay modificationprocedure. The total number of energy parameters is 2C as the energiesare calculated separately for both channels.

The time aligned left and right channel signals L′(k) and R′(k) arecombined to form a mono signal, for example by determining a sum of theleft and right channels:

M′(k)=(L′(k)+R′(k))/2  (7)

In some embodiments of the invention, the mono signal can also becalculated in the time domain. Now it is possible, as an alternativemethod, to compute gain values for energy subbands

$\begin{matrix}{{{G_{X}(c)} = {\log_{10}\left( \sqrt{\frac{\sum\limits_{k = 0}^{k_{c + 1} - k_{c} - 1}{{X_{c}(k)}}^{2}}{\sum\limits_{k = 0}^{k_{c + 1} - k_{c} - 1}{{M_{c}^{\prime}(k)}}^{2}}} \right)}},} & (8)\end{matrix}$

where M′(k) has been divided into energy subbands similarly as X(k).G_(X)(c) is quantized into Ĝ_(X)(c) and submitted to the second decoder20. Logarithmic domain representation is used based on properties ofhuman perception.

The mono signal m′(t) (time domain equivalent of M′(k)) is encoded witha mono encoder as presented in FIG. 1 b. In the decoder a synthesizedmono signal {circumflex over (m)}′(t)will be obtained which is windowedand transformed to the frequency domain to produce the frequency domainrepresentation {circumflex over (M)}′(k) of the synthesized mono signal.Next, the frequency domain left channel signal {circumflex over (L)}′(k)and the right channel signal {circumflex over (R)}′(k) are obtained from{circumflex over (M)}′(k) with a scaling operation, which is performedseparately for every energy subband and for both channels. In the caseof quantized energy values the scaled signals are obtained as

$\begin{matrix}{{{{\hat{X}}_{c}^{\prime}(k)} = {\frac{10^{{\hat{E}}_{X}{(c)}}\left( {k_{c + 1} - k_{c} - 1} \right)}{\sum\limits_{k = 0}^{k_{c + 1} - k_{c} - 1}{{{\hat{M}}_{c}^{\prime}(k)}}^{2}}{{\hat{M}}^{\prime}(k)}}},{k = 0},\ldots \mspace{14mu},{k_{c} + 1 - k_{c} - 1},} & (9)\end{matrix}$

If scaling factors G_(X) were used, scaled signals are obtained simplyas

{circumflex over (X)} _(c)′(k)=10^(Ĝ) ^(X) {circumflex over (M)}_(c)′(k),  (10)

Both in the equations (9) and (10) the notation X is either L for theleft channel or R for the right channel. Equations (9) and (10) simplyreturn the energy of the subband to the original level. After this hasbeen performed for all energy subbands in both channels, processing canbe continued by the delay insertion block 20.3 for returning delays totheir original values.

Depending on the core coding method, the spatial ambience of decodedbinaural signal as perceived by the user may shrink compared to theoriginal signal. It means that even though the directions of the soundsare correct, the ambience around the listener may not sound genuine.This may be because the two channels are so similar that sounds do notperceptually externalize from the inside of the listeners head. This isespecially typical when parametric representation of the spatial signalis used. This holds both in the case when parametric stereo coding hasbeen integrated with the binaural codec (FIG. 1 b) and when parametricstereo coding is used outside the actual binaural coding part (FIG. 1a).

Externalization can be improved with the means of decorrelationprocessing. The need for the decorrelation processing can be estimatedfor example using coherence analysis for the input signals:

$\begin{matrix}{{H = \frac{\sum\limits_{k = N_{1}}^{N_{2}}{{{L(k)}}{{R(k)}}}}{\sqrt{\left( {\sum\limits_{k = N_{1}}^{N_{2}}{{L(k)}}^{2}} \right)\left( {\sum\limits_{k = N_{1}}^{N_{2}}{{R(k)}}^{2}} \right)}}},} & (11)\end{matrix}$

where N1 and N2 define the frequency region from where the similarity ismeasured. The value of H varies in range [0, 1], and the smaller thevalue is the less there is similarity between channels and the strongeris the need for decorrelation. The value of H may be quantized to Ĥ andsubmitted to the decoder.

In one embodiment of the invention, an all-pass type of decorrelationfilter may be employed to the synthesized binaural signals. The usedfilter is of the form

$\begin{matrix}{{{D(z)} = \frac{\alpha + z^{- P}}{1 + {\alpha \; z^{- P}}}},} & (12)\end{matrix}$

where P is set to a fixed value, for example 50 samples for 32 kHzsignal. The parameter α is used such that it gets opposite values forthe two channels, for example values 0.4 and −0.4 can be used for leftand right channels, respectively. This maximizes the perceptualdecorrelation effect.

The synthesized (output) signal is now obtained as

{tilde over (L)}(z)β₁ z ^(−P) ^(D) {circumflex over (L)}(z)+β₂D(z){circumflex over (L)}(z)

{tilde over (R)}(z)β₁ z ^(−P) ^(D) {circumflex over (R)}(z)+β₂D(z){circumflex over (R)}(z)  (13)

where β₁ and β₂ are scaling factors obtained as a function of Ĥ, forexample β₁=Ĥ and β₂=1−β₁. Alternatively it is for example possible toselect β₁ and β₂ such that in case of two independent signals the energylevel is maintained, i.e. β₁=Ĥ, β₁ ²+⊕₂ ²=1. In equation (11) PD is theaverage group delay of the decorrelation filter of the equation (12).

In true binaural recordings (recorded with artificial head or using ahead set with microphones in the ears) a typical situation is that β₂ isset to 1 and β₁ to zero. If the binaural signal has been generatedartificially, for example by using head related transfer functions(HRTF) or head related impulse responses (HRIR), it is typical thatchannels are strongly correlated and β₁ is set to one, and β₂ to zero.If the value of Ĥ changes from one frame to another, for example linearinterpolation can be used for β₁ and β₂ values between time domaininstants when the parameters are updated such that there are no suddenchanges in the values.

The usage of the scaling factors for decorrelation is also dependent onthe properties of the core codec. For example if a mono core codec isused, strong decorrelation may be needed. In the case of a parametricstereo core codec, the need for decorrelation may be at the averagelevel. When a mid-side stereo is used as the core codec, there may notbe a need for decorrelation or only a mild decorrelation may be used.

In the above one possible implementation for a binaural codec waspresented. However, it is obvious that there can be numerous differentimplementations with slightly different operations. For example, theShifted Discrete Fourier Transform (SDFT) can be used instead of theDFT. This enables for example direct transform from SDFT domain to MDCTdomain in the encoder, which enables efficient implementation with lowdelay.

In the above described implementation there are D_(max) zeroes at atleast one end of the window. However, it is possible to have anembodiment of the invention without using these zeroes; in that case thesamples are cyclically shifted from one end of the window to the otherwhen the time shift is applied. This may result in compromised audioquality, but on the other hand computational complexity is slightlylower due to inter alia shorter transforms and less overlap.

The central part of the proposed window does not have to be sinusoidalwindow as long as the condition mentioned after the equation (2) isfulfilled. Different techniques can be used for computing the energies(equation (6)) without essentially changing the main idea of theinvention. It is also possible to calculate and quantize gain valuesinstead of energy levels.

There are also several possibilities for calculating the need fordecorrelation (equation (9) as well as for implementing the actualdecorrelation filter.

The delay estimation may also be recursive wherein the analysis block8.5 uses first a coarser resolution in the search and after approachingthe correct delay the analysis block 8.5 can use a finer resolution inthe search to make the delay estimate more accurate.

It is also possible that the first encoder 8 does not align the signalsof the different channels but only determines the delay and informs itto the second decoder 20 wherein the combined signal is provided fordecoding without delaying. In this embodiment the second decoder 20delays the signal of the other channel.

As used in this application, the term ‘circuitry’ refers to all of thefollowing:

(a) to hardware-only circuit implementations (such as implementations inonly analog and/or digital circuitry) and(b) to combinations of circuits and software (and/or firmware), such as:(i) to a combination of processor(s) or (ii) to portions ofprocessor(s)/software (including digital signal processor(s)), software,and memory(ies) that work together to cause an apparatus, such as amobile phone, a server, a computer, a music player, an audio recordingdevice, etc, to perform various functions) and(c) to circuits, such as a microprocessor(s) or a portion of amicroprocessor(s), that require software or firmware for operation, evenif the software or firmware is not physically present.

This definition of ‘circuitry’ applies to all uses of this term in thisapplication, including in any claims. As a further example, as used inthis application, the term “circuitry” would also cover animplementation of merely a processor (or multiple processors) or portionof a processor and its (or their) accompanying software and/or firmware.The term “circuitry” would also cover, for example and if applicable tothe particular claim element, a baseband integrated circuit orapplications processor integrated circuit for a mobile phone or asimilar integrated circuit in server, a cellular network device, orother network device.

The invention is not solely limited to the above described embodimentsbut it can be varied within the scope of the appended claims.

1. A method comprising using samples of at least a part of an audiosignal of a first channel and a part of an audio signal of a secondchannel to estimate a time delay between said part of the audio signalof said first channel and said part of the audio signal of said secondchannel; windowing the samples of said first channel and said secondchannel by a window function to form an analysis frame of said firstchannel and an analysis frame of said second channel; performing atime-to-frequency domain transform on the analysis frames to form afrequency domain representation of said part of the audio signal of saidfirst channel and said part of the audio signal of said second channel;and determining an inter-channel time delay between said part of theaudio signal of the first channel and said part of the audio signal ofsaid second channel on the basis of the frequency domainrepresentations.
 2. The method according to claim 1, wherein said windowfunction comprises a first window and a set of predetermined values atleast at one end of the first window wherein said predetermined valuesare zeros.
 3. (canceled)
 4. The method according to claim 2, whereinsaid window function is ${{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{{win}_{c}\left( {t - D_{\max}} \right)},} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1}} \\{0,} & {{t = {D_{\max} + L}},\ldots \mspace{14mu},{L + {2D_{\max}} -}}\end{matrix} \right.$ where D_(max) is the maximum shift allowed,win_(c)(t) is the first window and L is the length of the first window.5. The method according to claim 1, wherein said determining comprises:shifting the frequency domain representation of the second channel torepresent a delayed audio signal of the second channel; defining a dotproduct between the frequency domain representation of the first channeland complex conjugate values of the shifted frequency domainrepresentation of the second channel; and determining the inter-channeltime delay as a value for the shift which maximizes a real value of thedot product.
 6. (canceled)
 7. The method according to claim 5, whereinsaid determining comprises: dividing the frequency domainrepresentations into a number of subbands; and performing the delayestimation at at least one subband of said number of subbands.
 8. Themethod according to claim 1, further comprising time aligning the firstchannel and the second channel to compensate for the determinedinter-channel delay, wherein said time aligning comprises shifting thesecond channel in relation to the determined inter-channel delay. 9.(canceled)
 10. The method according to claim 7, wherein the methodcomprises: searching similarities within signals of the first channeland the second channel at each subband; and time aligning the firstchannel and the second channel only on such subbands in which saidsearching similarities indicates that the signal of the first channeland the signal of the second channel can be considered similar enough.11. The method according to claim 10, wherein said searchingsimilarities comprises defining a dot product between the frequencydomain representation of the first channel and complex conjugate valuesof the shifted frequency domain representation of the second channel;finding a value for the shift which maximizes a real value of the dotproduct; and comparing the maximum of the real value of the dot productwith a threshold to determine whether the signal of the first channeland the signal of the second channel can be considered similar enough atthe subband.
 12. The method according to claim 10, wherein saidsearching similarities comprises defining a correlation between thefrequency domain representation of the first channel and complexconjugate values of the shifted frequency domain representation of thesecond channel; finding a value for the shift which maximizes thecorrelation; and comparing the correlation with a threshold to determinewhether the signal of the first channel and the signal of the secondchannel can be considered similar enough at the subband.
 13. (canceled)14. (canceled)
 15. The method according to claim 5, wherein a set ofshift values are defined, wherein the method comprises selecting theshift from said set of shift values to determine the inter-channel timedelay.
 16. (canceled)
 17. (canceled)
 18. (canceled)
 19. (canceled) 20.The method according to claim 1, wherein the method comprisesdetermining a need for decorrelation between said audio signal of thefirst channel and said audio signal of the second channel; and providingan indication of the need for decorrelation.
 21. A method comprisingreceiving an encoded audio signal of a first channel and an encodedaudio signal of a second channel; receiving an indication of aninter-channel time delay between said encoded audio signal of the firstchannel and said encoded audio signal of the second channel; decodingsaid encoded audio signal of the first channel and said encoded audiosignal of the second channel to form decoded samples of the audio signalof the first channel and the audio signal of the second channel;performing a time-to-frequency domain transform on the windowed samplesto form a frequency domain representation of said audio signal of saidfirst channel and said audio signal of said second channel; shifting thefrequency domain representation of one of said audio signal of saidfirst channel and said audio signal of said second channel on the basisof said indication; performing a frequency-to-time domain transform onthe frequency domain representation of said audio signal of said firstchannel and said audio signal of said second channel to form decodedsamples of the audio signal of the first channel and of the audio signalof the second channel; and windowing said decoded samples of said firstchannel and said second channel by a window function to form asynthesized audio signal of the first channel and a synthesized audiosignal of the second channel.
 22. The method according to claim 21,wherein said window function comprises a first window and a set ofpredetermined values at least at one end of the first window, whereinsaid predetermined values are zeros.
 23. (canceled)
 24. The methodaccording to claim 22, wherein said window function is${{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{{win}_{c}\left( {t - D_{\max}} \right)},} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1}} \\{0,} & {{t = {D_{\max} + L}},\ldots \mspace{14mu},{L + {2D_{\max}} -}}\end{matrix} \right.$ where D_(max) is the maximum shift allowed,win_(c)(t) is the first window and L is the length of the first window.25. The method according to claim 21, wherein the method comprisesreceiving an indication of a need for decorrelation between the audiosignal of the first channel and the audio signal of the second channel;if the indication indicates the need for decorrelation, decorrelatingthe synthesized audio signal of the first channel and the synthesizedaudio signal of the second channel.
 26. The method according to claim21, wherein the method comprises: dividing the frequency domainrepresentations into a number of subbands; and performing the shiftingfor at least one subband of said number of subbands.
 27. (canceled) 28.(canceled)
 29. (canceled)
 30. (canceled)
 31. An apparatus comprisingmeans for using samples of at least a part of an audio signal of a firstchannel and a part of an audio signal of a second channel to estimate atime delay between said part of the audio signal of said first channeland said part of the audio signal of said second channel; means forwindowing the samples of said first channel and said second channel by awindow function to form an analysis frame of said first channel and ananalysis frame of said second channel; means for performing atime-to-frequency domain transform on the analysis frames to form afrequency domain representation of said part of the audio signal of saidfirst channel and said part of the audio signal of said second channel;and means for determining an inter-channel time delay between said partof the audio signal of the first channel and said part of the audiosignal of said second channel on the basis of the frequency domainrepresentations.
 32. The apparatus according to claim 31, wherein saidwindow function comprises a first window and a set of predeterminedvalues at least at one end of the first window wherein saidpredetermined values are zeros.
 33. (canceled)
 34. The apparatusaccording to claim 32, wherein said window function is${{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{{win}_{c}\left( {t - D_{\max}} \right)},} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1}} \\{0,} & {{t = {D_{\max} + L}},\ldots \mspace{14mu},{L + {2D_{\max}} -}}\end{matrix} \right.$ where D_(max) is the maximum shift allowed,win_(c)(t) is the first window and L is the length of the first window.35. The apparatus according to claim 31, wherein said means fordetermining are configured for: shifting the frequency domainrepresentation of the second channel to represent a delayed audio signalof the second channel; and defining a dot product between the frequencydomain representation of the first channel and complex conjugate valuesof the shifted frequency domain representation of the second channel;and determining the inter-channel time delay as a value for the shiftwhich maximizes a real value of the dot product.
 36. (canceled)
 37. Theapparatus according to claim 35, wherein said means for determining areconfigured for: dividing the frequency domain representations into anumber of subbands; and performing the delay estimation at at least onesubband of said number of subbands.
 38. The apparatus according to claim31, further comprising means for time aligning the first channel and thesecond channel to compensate for the determined inter-channel delay,wherein said means for time aligning are configured for shifting thesecond channel in relation to the determined inter-channel delay. 39.(canceled)
 40. The apparatus according to claim 37, wherein theapparatus comprises: means for searching similarities within signals ofthe first channel and the second channel at each subband; and said meansfor time aligning are configured for time aligning the first channel andthe second channel only on such subbands in which said searchingsimilarities indicates that the signal of the first channel and thesignal of the second channel can be considered similar enough.
 41. Theapparatus according to claim 40, wherein said means for searchingsimilarities are configured for: defining a dot product between thefrequency domain representation of the first channel and complexconjugate values of the shifted frequency domain representation of thesecond channel; finding a value for the shift which maximizes a realvalue of the dot product; and comparing the maximum of the real value ofthe dot product with a threshold to determine whether the signal of thefirst channel and the signal of the second channel can be consideredsimilar enough at the subband.
 42. The apparatus according to claim 40,wherein said means for searching similarities are configured for:defining a correlation between the frequency domain representation ofthe first channel and complex conjugate values of the shifted frequencydomain representation of the second channel; finding a value for theshift which maximizes the correlation; and comparing the correlationwith a threshold to determine whether the signal of the first channeland the signal of the second channel can be considered similar enough atthe subband.
 43. The apparatus according to claim 40, wherein a set ofshift values are defined, wherein the apparatus comprises means forselecting the shift from said set of shift values to determine theinter-channel time delay.
 44. (canceled)
 45. (canceled)
 46. (canceled)47. (canceled)
 48. The apparatus according to claim 31, wherein theapparatus comprises means for determining a need for decorrelationbetween said audio signal of the first channel and said audio signal ofthe second channel; and means for providing an indication of the needfor decorrelation.
 49. An apparatus comprising means for receiving anencoded audio signal of a first channel and an encoded audio signal of asecond channel; means for receiving an indication of an inter-channeltime delay between said encoded audio signal of the first channel andsaid encoded audio signal of the second channel; means for decoding saidencoded audio signal of the first channel and said encoded audio signalof the second channel to form decoded samples of the audio signal of thefirst channel and the audio signal of the second channel; means forperforming a time-to-frequency domain transform on the windowed samplesto form a frequency domain representation of said audio signal of saidfirst channel and said audio signal of said second channel; means forshifting the frequency domain representation of one of said audio signalof said first channel and said audio signal of said second channel onthe basis of said indication; means for performing a frequency-to-timedomain transform on the frequency domain representation of said audiosignal of said first channel and said audio signal of said secondchannel to form decoded samples of the audio signal of the first channeland of the audio signal of the second channel; and means for windowingsaid decoded samples of said first channel and said second channel by awindow function to form a synthesized audio signal of the first channeland a synthesized audio signal of the second channel.
 50. The apparatusaccording to claim 49, wherein said window function comprises a firstwindow and a set of predetermined values at least at one end of thefirst window wherein said predetermined values are zeros.
 51. (canceled)52. The apparatus according to claim 50, wherein said window function is${{win}(t)} = \left\{ \begin{matrix}{0,} & {{t = 0},\ldots \mspace{14mu},{D_{\max} - 1}} \\{{{win}_{c}\left( {t - D_{\max}} \right)},} & {{t = D_{\max}},\ldots \mspace{14mu},{D_{\max} + L - 1}} \\{0,} & {{t = {D_{\max} + L}},\ldots \mspace{14mu},{L + {2D_{\max}} -}}\end{matrix} \right.$ where D_(max) is the maximum shift allowed,win_(c)(t) is the first window and L is the length of the first window.53. The apparatus according to claim 49, wherein the apparatus comprisesmeans for receiving an indication of a need for decorrelation betweenthe audio signal of the first channel and the audio signal of the secondchannel; means for decorrelating the synthesized audio signal of thefirst channel and the synthesized audio signal of the second channel, ifthe indication indicates the need for decorrelation.
 54. The apparatusaccording to claim 49, wherein the apparatus comprises: means fordividing the frequency domain representations into a number of subbands;and means for performing the shifting for at least one subband of saidnumber of subbands.
 55. (canceled)
 56. (canceled)
 57. A computer programproduct comprising a computer program code configured to, with at leastone processor, cause an apparatus to: use samples of at least a part ofan audio signal of a first channel and a part of an audio signal of asecond channel to estimate a time delay between said part of the audiosignal of said first channel and said part of the audio signal of saidsecond channel; window the samples of said first channel and said secondchannel by a window function to form an analysis frame of said firstchannel and an analysis frame of said second channel; perform atime-to-frequency domain transform on the analysis frames to form afrequency domain representation of said part of the audio signal of saidfirst channel and said part of the audio signal of said second channel;and determine an inter-channel time delay between said part of the audiosignal of the first channel and said part of the audio signal of saidsecond channel on the basis of the frequency domain representations. 58.A computer program product comprising a computer program code configuredto, with at least one processor, cause an apparatus to: receive anencoded audio signal of a first channel and an encoded audio signal of asecond channel; receive an indication of an inter-channel time delaybetween said encoded audio signal of the first channel and said encodedaudio signal of the second channel; decode said encoded audio signal ofthe first channel and said encoded audio signal of the second channel toform decoded samples of the audio signal of the first channel and theaudio signal of the second channel; perform a time-to-frequency domaintransform on the windowed samples to form a frequency domainrepresentation of said audio signal of said first channel and said audiosignal of said second channel; shift the frequency domain representationof one of said audio signal of said first channel and said audio signalof said second channel on the basis of said indication; perform afrequency-to-time domain transform on the frequency domainrepresentation of said audio signal of said first channel and said audiosignal of said second channel to form decoded samples of the audiosignal of the first channel and of the audio signal of the secondchannel; and window said decoded samples of said first channel and saidsecond channel by a window function to form a synthesized audio signalof the first channel and a synthesized audio signal of the secondchannel.